Sprite 1984

home *** CD-ROM | disk | FTP | other *** search

/ Sprite 1984 - 1993 / Sprite 1984 - 1993.iso / docs / howto / server-info / allspice.debug.me < prev next >

Wrap

Text File | 1992-12-14 | 6.8 KB | 202 lines

.uh "How to Boot Allspice" .pp I am ``Allspice'', Sprite's root file server. To boot after a power-up try .(l >b sd()new .)l to boot from disk. If this hangs or doesn't work for some reason, you may have to do a network boot from ginger: .(l >b ie(0,961c,43)sun4.md/new .)l If you get a ``phase error'' when booting off the disk, you need to reset the bus and try again. To reset the bus, try booting from a non-existent disk, e.g., .(l >b sd(0,6)new .)l If you don't want allspice to be the root server, use the -backup flag: .(l >b ie(0,961c,43)sun4.md/new -backup .)l To reboot when running Sprite, use the shutdown command. .(l % sync % shutdown -R 'ie(0,961c,43)sun4.md/new .)l The ``sync'' command writes out the cache; it isn't required unless you are paranoid. Shutdown will sync the disks as the last thing before rebooting. .pp If Allspice is too wedged to get things done with user commands, then sync the disks with: .(l break-W .)l This should print a message about queuing a call to sync the disks, and when it is done it should print a ``.'' and a newline. If you don't get the newline then Allspice is deadlocked inside the file system cache, sigh. .pp (Note: on a regular Sun keyboard, this would be L1-W, and you'd use the L1 key like a shift key. On a regular ASCII terminal, like Allspice's console, you use the break key like escape: break then N.) .pp You can abort Allspice with: .(l break-A .)l And then use the boot command described above. .uh "Debugging Tips" .pp The current procedure for an Allspice crash is to take a core dump and then reboot. .pp If Allspice acts up then you might try the following things. If you aren't logged in, log in as root. Useful commands are: .(l allspice # rpcstat -srvr .)l Which dumps out the status of all the RPC server processes. If a bunch are ``busy'', and they remain busy with the same RPC ID and client, then there may be a deadlock. If they are all in the ``wait'' state it means that the Rpc_Daemon process is not doing rebinding for some reason. .(l allspice # ps -a .)l This will tell you if any important daemons have died. In particular, verify that arpd, ipServer, portmap, unfsd, inetd, tftpd, bootp, lpd, and sendmail are still around. If the ipServer is in the DEBUG state you can kill it and the daemons that depend on it with /hosts/allspice/restartIPServer. .(l % rpcecho -h \fIhostname\fP -n 1000 .)l This program, which is found in /sprite/src/benchmarks/rpcecho, and may or may not be installed in /sprite/cmds, will tell you if there timeouts when using the RPC protocol to talk to another host. If you suspect that a host with an Intel ethernet interface is flaking out, you can try this command. Lot's of timeouts indicate trouble. You can reset a host's network interface from its console with .(l break-N .)l .pp If RPCs to Allspice are hanging but there's no obvious sign of trouble, the problem might be that the timer queue is wedged. To verify that this is the problem, type .(l break-T .)l This will give the current time (as a number) and the time that elements of the timer queue are supposed to be processed, sorted by increasing time. You may need to use ctrl-S to freeze the display (use ctrl-Q to unfreeze it). If the current time is greater than the earliest element in the timer queue, the timer is wedged and needs prodding. To prod the timer, type .(l break-A .)l to go to the PROM monitor, and then type ``c'' to continue back to Sprite. If this fails to unwedge the timer, you should reboot. .uh "Kernel Debugging" .pp If Allspice is so hung you can't explore with user commands, then the best you can do is sync the disks with: .(l break-W .)l Then throw Allspice into the debugger with: .(l break-D .)l If this drops you into the monitor (the '>' prompt), you can still get into the debugger by typing 'c' to the monitor. You may have to do this twice. You should eventually get a message about ``Entering the debugger...''. .pp You have to run the debugger from shallot or another sun4 unix machine, unless there is a stand-alone Sprite machine available. To login to shallot, you can use ginger's console, next to you, on top of ginger. You should verify that Allspice is accessible by running .(l ginger% kmsg -v allspice .)l This should return the kernel version that Allspice is running. If this times-out then either Allspice isn't in the debugger, or more likely, no one is responding to ARP requests for Allspice's IP address. Run the setup-arp script that is in ~sprite bin: .(l ginger% setup-arp .)l Now rlogin to shallot and run the Sprite kernel debugger. The kernel images should be copied to ginger:/home/ginger/sprite/kernels (visible as /home/ginger/sprite/kernels on shallot), and their version number should be evident in their name, e.g. sun4.1.065. If not, you can run strings on the kernel images and grep for ``VERSION''. .(l shallot% strings /tmp/sprite/sun4.sprite | egrep VERSION .)l To run the kernel debugger. (kgdb.sun4 is in ~sprite/cmds.sun4.) .(l shallot% cd /home/ginger/sprite/kernels shallot% Gdb sun4.\fIversion\fP .)l If the RPC system seems to be the problem, you can dump the trace of recent RPCs by calling Rpc_PrintTrace(numRecs) .(l (kgdb) print Rpc_PrintTrace(50) .)l If there is a deadlock you can dump the process table: .(l (kgdb) print Proc_Dump() .)l or, if you want to look at only waiting processes: .(l (kgdb) print Proc_KDump() .)l You can switch from process to process and to stack backtraces by using the 'pid' command. You only need to specify the last two hex digits of the process ID. If you only have a decimal ID, then you have to type the whole thing. File system deadlocks center around locked handles, usually. When you find a process stuck in Fsutil_HandleFetch of Fsutil_HandleLock you can try to find the culprit by looking at the *hdrPtr these guys are waiting on. There is a 'lockProcessID' in the hdrPtr that is really the address of a Proc_ControlBlock. You can print this out with something like: .(l (kgdb) print *(Proc_ControlBlock *)(hdrPtr->lockProcessID) .)l You can reboot Allspice from within kgdb with the reboot command. .(l (kgdb) reboot ie(0,9634)sun4.md/new .)l .uh "Taking a core dump" .pp Step 1) Make sure Allspice is in the debugger. If not, put it in the debugger. .pp Step 2) Login to ginger. Go to a file system with > 40 megabytes free space, e.g., /home/ginger/cores (now /export1/cores). .pp Step 2.5) You might need to run ~sprite/cmds.sun3/setup-arp for ginger to be able to talk with allspice. .pp Step 3) Run kgcore as follows: "~sprite/cmds.sun3/kgcore -v allspice" The -v is optional but I like it because it prints progress messages. Note that this step can take several (~ 5) minutes. .pp Step 4) Rename the file "vmcore" so something more meaningful such as "mv vmcore vmcore.allspice.crash.11-21". .pp Step 5) Reboot allspice. (~sprite/cmds.sun3/kmsg -R "sd()new" allspice) .uh "Modify date" .pp These notes were last updated by Mike Kupfer on \*(td.